Coding for DS and DM
R coding module

Lecture 8

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Meme of the day

R packages

  • The objective of this lecture/lectures is to present the basic structure of an R package and a set of simple but powerful routines that can be used by you to build your R packages.
  • I will showcase the relevant tools and, at the end, we will build an R package that can be used to decide the optimal date for a happy hour.

What is an R package?

  • Following the R Package - 2e book, we could say that An R package is the fundamental unit of shareable R code. A package bundles together code, data, documentation, and tests, and is easy to share with others.
  • An R package can be stored on CRAN (Comprehensive R Archive Network), whereas its development version can be stored on GitHub or other hosting services.
  • R packages are organized in a standardized format that we must follow. Organizing code always makes your life easier since we can follow a template.

What is an R package? (cont)

  • A bit of terminology now… A package is a directory of files that extend R, containing, at minimum, the files DESCRIPTION and NAMESPACE and an R/ directory.
  • A package is not a library.
  • Beware, maintaining and updating an R package can be an extremely time-consuming process…

Maybe I’ll become a theoretician. Nobody expects you to maintain a theorem.
-Doug Bates (Matrix and RcppEigen maintainer, lme4-author)

Peek at the desired product

Now we are going to develop an R package named statsAndBooze. The objective of this package is to find the optimal date for happy hour given a set of constraints.

library(statsAndBooze)
beer_dates <- parse_dates(
  dates = list(
    andrea = c("2024-11-27", "2024-11-28", "2024-11-29", "2024-11-30", "2024-01-12"), ## available from 27/11 to 1/12
    federico = c("2024-11-29", "2024-12-02") ## available on 2 days
  )
)
decide_happy_hour(beer_dates)
#> [1] "2024-11-29"

Let’s start from scratch…

  • We are going to use the R package devtools
  • The following code can be used to generate the skeleton of an empty R package named packageName:
library(devtools)
create_package("path/for/the/R/packageName")

For example, I’m going to run

library(devtools)
create_package("/Users/andrea/Documents/r_packages/statsAndBooze")
  • The chosen path should point to a non-existing directory that will be created by RStudio. Do not store an R package inside another R package or a Git repo.

Let’s start from scratch… (cont)

  • The previous command should open a new RStudio session that contains the skeleton of an empty R package. We will explore its content in a couple of minutes.
  • You should also see a log message like
✔ Creating /Users/andrea/Documents/r_packages/statsAndBooze/.
✔ Setting active project to "/Users/andrea/Documents/r_packages/statsAndBooze".
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: statsAndBooze
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
* First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)

Now we can analyze the log more precisely together!

Let’s start from scratch… (cont)

The following lists the content of the new directory:

  • .gitignore: Controls Git versioning.
  • .Rbuildignore: Similar to .gitignore, but excludes files from the package build.
  • DESCRIPTION: Stores the metadata of your package (e.g., author, description, dependencies).
  • R/: The directory where R scripts go.
  • NAMESPACE: Declares the functions your package exports and the external functions it imports from other packages. DO NOT EDIT BY HAND.

For more details, check here.

Add Git for Version Control

  • Now that we’ve started a new R session, reload devtools if it’s not already loaded.
  • Use use_git() to initialize a Git project inside the repository.
  • In an interactive session, RStudio may prompt you to perform the first commit and restart. You should accept both.
  • After restarting, a Git panel will appear in RStudio!

Edit the DESCRIPTION

Now we can edit the DESCRIPTION file:

  • Title: Find the best day to have a beer!
  • Version: Refer to Semantic Versioning.
  • Authors@R: Add name, surname, email, and ORCID.
  • License: See Choose a License.
  • Description: We’ll fill this in at the end.

If co-authoring the package, list their names and roles in the Authors field. More details here.

Package dependencies

  • As mentioned, we’re developing an R package to parse dates and intervals.
  • Working with dates can be a nightmare… So, we’ll use wrappers from the lubridate package.
  • When developing an R package, avoid using library() as it only works for interactive sessions. See details here and here.
  • Instead, use use_package("pkg") to declare dependencies.

Package dependencies (cont)

  • You should see the following output:
use_package("lubridate")
✔ Adding 'lubridate' to Imports field in DESCRIPTION
* Refer to functions with `lubridate::fun()`

This message indicates that when using lubridate functions, you should prefix them with lubridate::.

  • Check the DESCRIPTION file to see the change.
  • This applies to any function not included in the base package.
  • Question: How can you determine which package defines a function?

Interactive development

  • It’s easier to run initial tests in an interactive session before adding them to the package.
  • Our first goal is to define a function that takes a list of date strings and returns the parsed dates:
parse_dates(
  dates = list(
    andrea = "2024-11-29", ## exactly in this format
    federico = "2024-11-30" ## exactly in this format
  )
)
$andrea
[1] "2024-11-29" ## should have class = "Date"

$federico
[1] "2024-11-30"
  • Now it’s your turn! Try coding this function.

The first function

  • Now that we have sketched the skeleton of the function, we can add it to our package. First, create an R script in the R/ folder by running use_r("path.R").
  • For example, run:
use_r("parse.R")
✔ Setting active project to '/Users/andrea/Documents/r_packages/statsAndBooze'
* Modify 'R/parse.R'
* Call `use_test()` to create a matching test file
  • Copy the function definition (and only the function definition) into the new script. When referring to a lubridate function, prefix it with lubridate::.

The first function (cont)

  • The parse.R file might look like:
parse_dates <- function(dates) {
  lapply(dates, lubridate::as_date)
}
  • Save the file and restart the R session. To make parse_dates() available for testing, restart R and run load_all().
  • Let’s try it together! Remember to remove any unnecessary library calls from your interactive testing script.
  • If everything works, it’s a good time to commit.

R CMD check

  • Every time you modify your R package in a significant way (e.g., adding a new function), check that all parts are working.
  • The R CMD check command is the gold standard for package checks.
  • Run it from the Build panel or use the check() function in devtools.
  • Let’s try it!

Documentation

  • Our new function doesn’t have a help file:
devtools::load_all(".")
?parse_dates
No documentation for ‘parse_dates’ in specified packages and libraries: you could try ‘??parse_dates’
  • To create a help page, add a specially formatted comment above the function definition. We can use the roxygen2 package for this.
  • In RStudio, open parse.R, place your cursor within the function, and select Code → Insert Roxygen Skeleton.

Documentation (cont)

  • You should see something like:
#' Title
#'
#' @param x 
#' @return
#' @export
#' @examples
parse_dates <- function(x) {
  lapply(x, lubridate::as_date)
}
  • Now we’ll fill in all the relevant parts.

Documentation (cont)

  • The final output should look like this:
#' Parse a list of strings into dates
#'
#' @details Please note that each date must be specified in the YYYY-MM-DD format.
#' @param dates A list of strings specifying dates.
#' @return A list of dates. Each string is converted to an object of class Date.
#' @export
#' @examples
#' parse_dates(list("2024-11-29", "2024-11-30"))
parse_dates <- function(dates) {
  lapply(dates, lubridate::as_date)
}

Documentation (cont)

  • Run document() to let roxygen2 generate the documentation.
  • Check the NAMESPACE file; you should see:
## Generated by roxygen2: do not edit by hand

export(parse_dates)
  • Let’s run the R CMD check again.
  • If everything works as expected, this is a good time for another commit!

A minimal R package

  • Now we have a minimal working package! We can install it by running devtools::install() or by using the Build panel.
  • After installing the package, try this in a fresh R session:
library(statsAndBooze)
beer_dates <- list(
  ## We can see that our function works with >= 2 people and >= 2 dates
  andrea = c("2024-11-29", "2024-11-30"), 
  federico = "2024-11-30",
  chiara = "2024-11-30"
)
parse_dates(beer_dates)
  • If there are no errors, our package works 🎉!

To infinity and beyond 🚀

  • Now it’s time to expand our package! Our objective is to decide a common day for a happy hour, and we’re not doing that yet.
  • Currently, we’re only parsing input constraints into a list of Date objects. The key step—the organization of the happy hour—is still missing!
  • As before, it’s convenient to start testing in an interactive session.

decide_happy_hour() function

  • How would you programmatically determine the common day in the following list?
library(statsAndBooze)
list_dates <- list(
  andrea = c("2024-11-29", "2024-11-30"), 
  federico = "2024-11-30",
  chiara = "2024-11-30"
)
parsed_dates <- parse_dates(list_dates)
decide_happy_hour <- function(x) {
  ... 
}
  • We’ll need to check which availabilities are shared among different people… see the next slide for a solution!

decide_happy_hour() function

  • Here’s a suggested solution:
decide_happy_hour <- function(x) {
  lubridate::as_date(Reduce(lubridate::intersect, x))
}
  • Test it in an interactive session.
  • If it works, create an R file (e.g., decide.R), add the function, and document it.
  • Exercise for home: Understand why we need the extra call to lubridate::as_date.

decide_happy_hour() function

  • Now, after completing the previous steps, load (note: loadinstall; see ?devtools::load_all for more details) and test in a fresh R session:
devtools::load_all(".")
list_dates <- list(
    andrea = c("2024-11-30","2024-12-01"), 
    federico = "2024-11-30"
  )
parsed_dates <- parse_dates(list_dates)
decide_happy_hour(parsed_dates)
  • If everything is correct, rerun R CMD check. Remember, R CMD check re-documents the package.

decide_happy_hour() function

  • Question: What’s the expected output of the following code when there is no common date?
devtools::load_all(".")
list_dates <- list(
    andrea = "2024-11-29", 
    federico = "2024-11-30"
  )
parsed_dates <- parse_dates(list_dates)
decide_happy_hour(parsed_dates)
  • Try to formulate a hypothesis and test it by running the code.
  • If everything works as expected, commit your changes!

Reinstall and more docs

  • Now is a good time to reinstall the package and verify that everything functions as expected.
  • You should see the following:
library(statsAndBooze)
list_dates <- list(
  andrea = c("2024-12-01", "2024-12-02"), 
  federico = c("2024-12-02", "2024-12-03"), 
  chiara = "2024-12-02"
)
parsed_dates <- parse_dates(list_dates)
decide_happy_hour(parsed_dates)
[1] 2024-12-02
  • Finish the documentation by adding examples and completing the DESCRIPTION file. Then, CHECK and commit!

Unit testing

  • The previous example informally shows that our R package works for a particular case.
  • Now we want to formalize our expectations into unit tests!
  • Why do we need unit testing? Two main reasons:
    1. We want to check for wrong inputs or edge cases that may not be obvious to users.
    2. We want to ensure that all functionalities work as expected, even after refactoring.

Unit testing (cont)

  • After loading devtools, run use_testthat() to set up the unit testing environment:
use_testthat()
✔ Setting active project to '/Users/andrea/Documents/r_packages/statsAndBooze'
✔ Adding 'testthat' to Suggests field in DESCRIPTION
✔ Setting Config/testthat/edition field in DESCRIPTION to '3'
✔ Creating 'tests/testthat/'
✔ Writing 'tests/testthat.R'
* Call `use_test()` to initialize a basic test file and open it for editing.
  • Next, use use_test(<file>) to create a new test file. For example, use_test("parse"):
use_test("parse")

Unit testing (cont)

  • Now edit the new file and write your unit test(s). Start by summarizing the objective of the test:
test_that("parse_dates(): basic functionalities work", {
    expect_equal(2 * 2, 4)
})
  • testthat provides several helper functions (expect_length(), expect_message(), expect_error(), etc.) to test aspects of your package (equality, errors, etc.).
  • Then, write the main part of the test, comparing observed output to our expected result.

Unit testing (cont)

  • For example:
test_that("parse_dates(): basic functionalities work", {
  input_strings <- list(
    andrea = "2024-12-03",
    marco = "2024-12-03"
  )
  expected_dates <- list(
    andrea = lubridate::as_date("2024-12-03"),
    marco = lubridate::as_date("2024-12-03")
  )
  expect_equal(parse_dates(input_strings), expected_dates)
})
  • After loading the package (load_all()), you can run the new test interactively like any other R function.

Unit testing (cont)

  • Repeat the same procedure for other functions:
use_test("decide")
✔ Setting active project to '/Users/andrea/Documents/r_packages/statsAndBooze'
✔ Writing 'tests/testthat/test-decide.R'
* Modify 'tests/testthat/test-decide.R'
  • The actual test might look like this:
test_that("decide_happy_hour(): basic functionalities work", {
  beer_dates <- list(
    andrea = lubridate::as_date("2024-12-03"),
    federico = lubridate::as_date("2024-12-03"),
    chiara = lubridate::as_date("2024-12-03")
  )
  expect_equal(
    decide_happy_hour(beer_dates), 
    lubridate::as_date("2024-12-03")
  )
})

Unit testing (cont)

  • You should also test edge cases that might not be exposed to regular end users.
test_that("decide_happy_hour(): empty intersection", {
  beer_dates <- list(
    andrea = lubridate::as_date("2024-12-03"),
    marco = lubridate::as_date("2024-12-04")
  )
  expect_equal(
    decide_happy_hour(beer_dates),
    lubridate::as_date(numeric(0))
  )
})
  • Similarly, test the function’s behavior with incorrect inputs, e.g., when one or more input dates are NA.

Unit testing (cont)

  • The function test() (from devtools) runs all tests in the package and provides a summary
test()
i Testing statsAndBooze
| F W S OK | Context
|      2 | decide
|      1 | parse

== Results ============================
Duration: 0.5 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3 ] 
  • This behavior is also shown when running R CMD check. If all looks good, commit your changes!

Sharing your code with others!

  • While developing an R package is still a good practice for personal use, the community may benefit from your work!
  • To let that happen, you shall host your package in a hosting service like GitHub
  • We do not have time to cover it here, but a good starting point is the great book Happy Git and GitHub for the useR
  • Spoiler: it will be painful at the beginning
  • Even more painful, but with a much greater visibility, is the option of having your package hosted in the The Comprehensive R Archive Network (CRAN)